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ABSTRACT 


In modern parallel architectures, memory accesses represent 
a common bottleneck. Thus, optimizing the way applica- 
tions access the memory is an important way to improve 
performance and energy consumption. Memory accesses are 
even more important with NUMA machines, as the access 
time to data depends on its location in the memory. Many 
efforts were made to develop adaptive tools to improve mem- 
ory accesses at the runtime by optimizing the mapping of 
data and threads to NUMA nodes. However, theses tools 
are not able to change the memory access pattern of the 
original application, therefore a code written without con- 
sidering memory performance might not benefit from them. 
Moreover, automatic mapping tools take time to converge 
towards the best mapping, losing optimization opportuni- 
ties. À deeper understanding of the memory behavior can 
help optimizing it, removing the need for runtime analysis. 

In this paper, we present TABARNAC, a tool for an- 
alyzing the memory behavior of parallel applications with 
a focus on NUMA architectures. TABARNAC provides a 
new visualization of the memory access behavior, focusing 
on the distribution of accesses by thread and by structure. 
Such visualization allows the developer to easily understand 
why performance issues occur and how to fix them. Using 
TABARNAC, we explain why some applications do not ben- 
efit from data and thread mapping. Moreover, we propose 
several code modifications to improve the memory access 
behavior of several parallel applications. 
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1. INTRODUCTION 


Using memory on modern parallel shared-memory sys- 
tems with a Non-Uniform Memory Access (NUMA) behav- 
ior is both trivial and extremely complex: an application 
is able to access the whole memory with the same inter- 
face, but to use it efficiently, the developer needs to take 
several performance factors into account, such as the cache 
hierarchy and the structure of the NUMA architecture [14]. 
NUMA machines are characterized by multiple memory con- 
trollers per system [3], dividing the physical main memory 
into several NUMA nodes. Each node can access its local 
memory directly, but has to transfer data through an inter- 
connection network to access memory on remote nodes. Cur- 
rent systems usually have one memory controller per socket, 
but architectures with multiple controllers per socket are 
becoming more common [2]. In NUMA systems, decisions 
about where to place the data that a parallel application 
uses have a significant impact on the overall performance, 
with most policies aiming at improving the locality of mem- 
ory accesses [11]. 

The optimal mapping of memory pages to NUMA nodes 
depends on the way an application accesses the memory. To 
improve the mapping without changing the application, sev- 
eral automatic tools were proposed [7, 8, 12, 32]. However, 
these tools have a runtime overhead as they need to analyze 
the application behavior during execution and lose opportu- 
nities for improvements during this training. Furthermore, 
they are not able to change the memory access pattern for 
additional improvements. Therefore, if the memory behav- 
ior is not designed for NUMA machines, their improvements 
might be limited. For instance, if all threads are accessing 
data from a single memory page, remote memory accesses 
will be triggered from all NUMA nodes but one, wherever 
the page is mapped. This kind of issue can only be solved 
by modifying the memory access behavior in the source code 
of the application, requiring a deep understanding of its be- 
havior. 

Several tools, such as Intel’s VTune [33] and Performance 
Counter Monitor (PCM) [19], the HPC Toolkit [1], and AMD’s 
CodeAnalyst [16], can be used to help the developer under- 


stand and improve the performance of parallel applications. 
However, these tools rely on hardware performance counters 
and can therefore provide only indirect and sampled infor- 
mation about the memory access behavior, through cache 
miss statistics, for example. Indeed, tracing the memory be- 
havior is complex, as many instructions trigger at least one 
memory access. Several studies have addressed this problem 
using sampling [23, 31, 18], and can find out what happens 
(remote accesses, cache misses ...), where (data structure, 
line of code), but not how data structures are accessed and 
shared by the different threads (which cause the remote ac- 
cesses). 

In this paper, we present TABARNAC", a set of Tools for 
Analyzing the Behavior of Applications Running on NUMA 
ArChitectures. TABARNAC provides tools to trace and vi- 
sualize the memory access behavior of parallel applications. 
More precisely, it helps to understand why performance is- 
sues occur by providing information on how data structures 
are accessed and shared by the different threads. Since it is 
based on memory accesses traces, TABARNAC has a very 
high accuracy while maintaining a reasonable overhead that 
enables the analysis of large applications. In an evaluation 
with several parallel applications, we show that relatively 
small code changes suggested by TABARNAC can substan- 
tially improve application performance. 

The rest of this paper is organized as follows. In the next 
section, we discuss related work and compare it to our pro- 
posal. Section 3 presents the design and implementation 
of TABARNAC. Our evaluation methodology is outlined 
in Section 4. We show example analyses and performance 
improvements with TABARNAC using several parallel ap- 
plications in Section 5. Finally, we present our conclusions 
and discuss ideas for future work in Section 6. 


2. RELATED WORK 


This section presents an overview of related work in the 
area of memory access profiling for parallel applications based 
on shared memory. We also discuss some mechanisms to im- 
prove performance on NUMA architectures. 


2.1 Memory Profiling 


Generic tools to evaluate parallel application performance, 
such as Intel’s VTune [33] and Performance Counter Mon- 
itor (PCM) [19], the HPCToolkit [1], and AMD’s Code- 
Analyst [16], provide only indirect information about the 
memory access behavior, more specific tools are therefore 
required to improve it. 

Profiling memory behavior raise two major challenges. 
The first one is the collection of accurate and detailed in- 
formation: performance counters provide precise and easy 
access to statistics about the CPU usage, but there are few 
such mechanisms for the memory. For a maximum level of 
detail, memory access traces need to be created. The sec- 
ond challenge is the amount of information that needs to be 
interpreted and presented to the developer. Memory access 
traces provide huge amounts of information on several di- 
mensions: data structure, threads, access type (read/write), 
sharing, and time of access. Presenting them to the de- 
veloper in a readable and meaningful way is therefore not 
trivial. 


' TABARNAC is available at: 
https://github.com/dbeniamine/Tabarnac 


2.1.1 Data Collection 


Several methods have been used to address the problem 
of data collection. A lot of studies deduce information from 
hardware performance counters [28, 20, 5, 36, 35, 10], which 
are special registers that allow to record events such as cache 
misses and remote memory accesses. However, these coun- 
ters only provide a partial view of the execution, they show 
events happening on the processor related to memory, but 
not what triggered them. Moreover, most available perfor- 
mance counters depend on the architecture, therefore it is 
hard to reproduce the same analysis on different machines 
with these tools. 

Another approach used by several tools [23, 31, 25, 18] 
consists of using sampling mechanisms such as AMD’s In- 
struction Based Sampling (IBS) [15] or Intel Precise Event 
Based Sampling (PEBS) [24] to analyze applications. Not 
only can sampling miss important events, leading to inaccu- 
rate characterizations, but these technologies are usually not 
portable and work only with a few recent architecture, there- 
fore such tools can only be used in special circumstances. 

Other studies uses hardware modification (with or with- 
out simulation) [4, 30]. Although they provide more efficient 
trace collection than tools implemented purely in software, 
they are even less portable. Finally, binary instrumentation 
can provide information about memory access behavior [9], 
although this method is slower than the other previously 
described techniques, it is more portable and precise. More- 
over, as we show in Section 5.3, an efficient instrumentation 
can provide an acceptable overhead. 


2.1.2 Visualization 


The second difficulty of memory analysis is to present the 
information in such a way that the developer can use it 
to improve the application. Some of the tools previously 
mentioned only provide a textual output [23, 31, 30]. Even 
if these tools highlight the most relevant informations, it 
is hard to get an overview of the memory behavior from 
such output. The developer might be presented with a huge 
amount of information and not be able to differentiate nor- 
mal behaviors from problematic ones. 

Other tools provide more advanced visualizations. For in- 
stance, Tao et al. [35] propose a detailed view of each mem- 
ory page, showing the number of remote and local accesses 
from each NUMA node. Weyers et al. [36] depict the mem- 
ory bandwidth between each pair of nodes, showing where 
the remote accesses occur. Other tools [10, 9, 5] provide 
several views of the execution, giving the ability to corre- 
late them with the source code of applications, similar to 
traditional performance tools such as VTune. Although all 
these tools can help developers understand the kind of per- 
formance issues they are facing, they do not give the reason 
why a particular issue is happening, for instance by showing 
the distribution of memory accesses within data structures. 

MemAxes [18] is one of the most advanced NUMA-oriented 
visualization tools. Figure 1 shows a screenshot of this tool 
on an example trace. It shows the source code of the appli- 
cation (left upper side), the NUMA hierarchy of the machine 
(right upper side) and a parallel coordinate graph (lower side) 
designed to help correlating information. Although this vi- 
sualization is designed to help understanding NUMA per- 
formance issues, it shows which event occurs and where it 
occurs, but does not tell directly why it occurs. The user 
still has to correlate several pieces of information to guess 


Figure 1: Screenshot from MemAxes on the example data 
trace provided with the tool. 


the source of a performance issue. 

Finally, the proposal of Liu et al. [25] is quite similar to 
the previous studies, but they also provide an address centric 
visualization, which shows how much each thread accesses a 
data structure. Such a visualization is a bit closer to provid- 
ing the source of the performance issue, but it does not show 
how the accesses are distributed inside a structure, and how 
the structure is shared between the threads. 


2.2 Data Mapping Mechanisms 


On NUMA architectures, data mapping mechanisms have 
the goal of improving the locality and balance of memory 
access between NUMA nodes. Traditionally, operating sys- 
tems have used the first-touch [29], next-touch [26] and inter- 
leave [22] policies to map memory pages to NUMA nodes. 
The first-touch policy, which is the default policy in most 
operating systems (such as Linux), allocates a page on the 
NUMA node that performs the first memory access to it. It 
requires the developer to take care of which thread accesses 
data first, as an incorrect first access can hurt performance. 
In next-touch [26], each page is periodically migrated to the 
NUMA node that performs the next access to a page. This 
technique is more flexible than first-touch, but can lead to 
excessive page migrations. The interleave policy (available 
in Linux via the numact1 tool [22]) distributes memory 
pages cyclically among all NUMA nodes, to improve load 
balance among memory controllers, but it does not take any 
locality into account. 

Newer developments in operating systems focus on refin- 
ing the data mapping during the execution of parallel appli- 
cations, using online profiling [12, 8]. Recent versions of the 
Linux kernel contain the NUMA Balancing technique [7], 
which uses page faults to determine if a page should be mi- 
grated to a different NUMA node. Other solutions improve 
the data mapping in the compiler, the runtime or at a li- 
brary level. Piccoli et al. [32] propose a compiler extension 
that analyzes the memory accesses patterns of parallel loops 
and uses this information to migrate pages before executing 
the loop. 

Libraries such as libnuma [22] and MAi [34] provide the 
ability to allocate data structures on a particular NUMA 
node, or with an interleave policy. These techniques can 
achieve large improvements, but require a deep understand- 
ing of the applications’ memory behavior to use them effi- 
ciently. Our study provides tools to easily understand the 
memory behavior and therefore enable the developer to im- 
prove performance significantly. 


2.3 Summary of Related Work 


Several studies already provide tools to analyze memory 
accesses. These tools usually point out which performance 
issues occur (such as a high number of cache misses or ac- 
cesses to remote NUMA nodes), sometimes where they occur 
(such as information about the structure, function, or line 
of code). Some tools helps to correlate these information 
to guess why an issue is happening. However, no tool di- 
rectly provides the reasons why such issues occur and how 
to fix them. Two types of information can help answer- 
ing this question: which thread is responsible for the first 
touch (as the default page mapping of most operating sys- 
tems depends on it), and how different threads access data 
structures. This study presents TABARNAC, a set of tools 
to explain why performance issues related to memory occur 
and how they can be resolved. 


3. TABARNAC 


TABARNAC: Tools for Analyzing the Behavior of Ap- 
plications Running on NUMA ArChitecture is divided into 
two parts: the instrumentation tool, which collects informa- 
tion about memory accesses, and the visualization, which 
presents a meaningful interpretation of the trace. In this 
section, we discuss the implementation of both parts. 


3.1 Collecting Memory Access Information 


TABARNAC data collection aims at providing informa- 
tion on how data structure are accessed, therefore, it needs 
to collect fine-grained information. To do so, we instrument 
memory access and collect the number of access per page by 
thread and type (Read/Write). The information is stored 
on a per-thread basis, as shown in Listing 1, making the 
code completely lock-free, as well as minimizing the amount 
of false sharing between threads. 


1 void mem_access (unsigned long address, int 
threadid, char type) { 


2 uint64_t page = address >> page_bits; 
3 acc [threadid] [page] [type]++; 
4 } 


Listing 1: Code executed on each memory access. Pin 
provides the address, threadid and type parameters. 


The instrumentation uses the Pin dynamic binary instru- 
mentation tool [27]. Although it is an Intel technology, it 
works also on AMD processors. Previous versions of Pin 
also support Intel Itanium (1A64) and ARM architectures. 

Before running the application, TABARNAC retrieves 
static memory allocation information. Dynamic allocations 
are intercepted at runtime and structure names are extracted 
using the debug information provided by the compiler. Fi- 
nally, each time a thread is created, we compute its stack 
bounds and create a virtual structure named St ack#N where 
N is the thread ID. Only structures that are bigger than 
one page (usually 4Kib in current x86_64 architectures) are 
recorded as our analysis granularity is the memory page. 
The data structure information (name, size and address) 
are only used to generate the visualization, after the end of 
the instrumentation. The memory access tracing is based on 
the earlier numalize tool [13], which only collected statistics 
about memory accesses to pages, without information about 
data structure or stacks. 
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Figure 2: Example plots from TABARNAC. 


3.2 Visualization 


Once the data collection phase is done, TABARNAC gen- 
erates the visualization (as an HTML page), providing a 
summary of the trace through several plots’. The visual- 
ization aims at showing why performances issues related to 
memory occur, it therefore shows several plots helping to un- 
derstand the importance of each data structure and how it 
is accessed. Each plot is introduced by an explanation of its 
presentation, what common issues it can help to understand 
and provides suggestions on how to fix these issues. The 
visualization starts with a small introduction, summarizing 
the main principles while developing for NUMA machines, 
and shows the hardware topology of the analyzed machine 
extracted with Hwloc [6]. 

After the introduction, the visualization focuses on the 
usage of data structures. Some structures are not displayed 
if less than 0.01% of the total accesses happen on them. This 
is done to make the output more readable by focusing on the 
most important structures. 

The first series of plots presents information concerning 
the relative importance of the data structures. It consists of 
two plots, showing first the size of each data structure, as 
in Figure 2(a), then the number of reads and writes in each 
structure (Figure 2(b)). These plots give a general idea of 
the structures used by the parallel application. Moreover, 
knowing the read/write behavior is very useful as it deter- 
mines the possible optimizations. For instance, structures 
written only during initialization (or very rarely) can be rel- 
atively easily duplicated, such that each NUMA node works 
on a local copy. 

The second series of plots is the most important one. It 
shows for each page of each structure which thread was re- 
sponsible for the first touch (Figure 2(c)). This information 
is important as the default policy for Linux and most other 
operating systems is to map a page as close as possible to the 
first thread accessing it. If the first touch distribution does 
not fit the actual access distribution, the default mapping 
performed by Linux might not be efficient. To address this 
issue, the developer can either correct the first touch or do 
some manual data mapping to ensure better memory access 


2A full example of TABARNAC’s output is available at: 
http: //dbeniamine.github.io/Tabarnac/examples. 
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Figure 3: Per thread access distribution inside a structure. 


locality and balance during the execution. 

Finally, TABARNAC shows the density of accesses per- 
formed by each thread and the global distribution. In the 
example shown in Figure 3, each horizontal line represents 
the number of accesses to one page, there is one line per 
thread and one for the average number of accesses. More- 
over, for each thread the average number of accesses to the 
structure is displayed. Darker lines indicate more memory 
accesses to the page. This visualization gives an easy way 
to understand the data sharing between threads, as well as 
the balance between pages and threads. These plots can be 
used to identify inefficient memory access behaviors and to 
determine the best NUMA mapping policy. 


4. EXPERIMENTAL SETUP 


This section briefly discusses our experimental setup for 
the evaluation of TABARNAC. 

We used two NUMA machines for our experiments, Turing 
and Idfreeze. The second machine was only used to com- 
pare the instrumentation overhead on Intel and AMD ma- 
chines, all the other experiments ran on Turing. The hard- 
ware details are summarized in Table 1. Turing runs ver- 


Vendor Model 
CPU turing Intel Xeon X7550 
Idfreeze AMD Opteron 6174 
System Nodes Threads Freq Memory 
totals Turing 4 64 2.00 Ghz 128 Gib 
Idfreeze 8 48 2.20 Ghz 256 Gib 
Per Cores Threads L3 Cache Memory 
node Turing 8 16 18 Mib 32 Gib 
Idfreeze 6 6 12 Mib 32 Gib 


Table 1: Hardware configuration of our evaluation system. 


sion 3.13 of the Linux kernel, while Idfreeze runs version 
3.2. 

All applications use OpenMP for parallelization, they were 
compiled with gcc, version 4.6.3, with the -02 optimiza- 
tion flag. Both analysis and performance evaluation are 
performed with 64 threads, which is the maximum num- 
ber of threads that our main evaluation machine (Turing) 
can execute in parallel. 

In the performance evaluation, we compare the following 
three traditional mapping policies to the version modified 
using the knowledge provided by TABARNAC. The original 
Linux kernel is our baseline for the experiments. We use an 
unmodified Linux kernel, version 3.13, with the first-touch 
policy. The NUMA Balancing mechanism is disabled in this 
baseline. The interleave policy is performed with the help of 
the numact1 tool [22]. We also compare our results to the 
recently introduced NUMA Balancing technique [7], which 
is executed with its default configuration. 

For the plots presenting speedups, each configuration was 
executed at least 10 times. Each point shows the arithmetic 
mean of all runs. The error bars in those plots represent the 
standard error. 


5. ANALYSIS AND RESULTS 


This section presents the results of our analysis. For each 
application, we show its memory access behavior, discuss 
strategies to optimize this behavior and present the perfor- 
mance improvements that can be achieved. 


5.1 Ondes3D 


Ondes8D is the main numerical kernel of the Ondes3D ap- 
plication [17]. It simulates the propagation of seismic waves 
using a finite-differences numerical method. Ondes3D has a 
memory usage of 11.3Gib with the parameters used for the 
performance evaluation, and 0.7Gib for the analysis. 

The analysis of the accesses distribution in Ondes3D (not 
displayed here) shows that each structure seems to be well 
distributed between the threads. However, for all structures, 
thread 0 is responsible for all first accesses, as we can see for 
vz0 in Figure 4(a). Due to this pattern, if we run Ondes3D 
without any improved mapping policy, every page will be 
mapped to the NUMA node that executes the thread 0, re- 
sulting in mostly remote accesses for the other threads. An 
easy fix is to perform the initialization in parallel and to 
pin each thread on a different core, or to use the interleave 
policy. Such a modification results in the first touch dis- 
tribution shown in Figure 4(b), which is now distributed 


60- 60- 


Thread Id 
è 
1 
Thread Id 


Y 
S 
1 
v 
[>] 
1 


T 1 f T ET T T T 
0 5000 10000 15000 0 5000 10000 15000 
Page number Page number 


(a) Original first-touch. (b) Improved first-touch. 


Figure 4: First-touch for structure vz0 from Ondes3D. 


among all the threads. 
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Figure 5: Speedup for Ondes3D compared to the baseline. 


We compare the performance of the modified version (First 
Touch) to the original (Base) version, running on the normal 
OS, with NUMA Balancing activated and with an Interleave 
policy. Figure 5 present the results of this evaluation. We 
can see that all methods improve the execution time com- 
pared to the OS, but NUMA Balancing provides less than 
30% speedup, while the static mappings (Interleave and the 
modified code) increase performance by 60%. Indeed, with 
NUMA Balancing, all pages are initially mapped by the OS 
to the NUMA node of thread 0, and are only moved later 
on, after many remote accesses have already occurred, los- 
ing some optimization opportunities. This is a case where 
static mapping can be substantially better than automated 
tools. The Interleave policy provides a similar speedup as 
First Touch since it distributes the pages over the NUMA 
nodes at the beginning of the execution, but our tool shows 
clearly the cause of the performances issue. 


5.2 The IS Benchmark 


We executed TABARNAC on the benchmarks from the 
OpenMP implementation of the NAS Parallel Benchmark 
suite (NPB) [21]. Most of them have either a well balanced 
accesses pattern between the threads or a totally random 
accesses distribution. For all of them, the first touch fits 
exactly the access distribution. However, the analysis of 
IS caught our attention. IS sorts a set of integer numbers 
using a parallel bucket sort algorithm. According to the 
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Figure 6: Memory access distribution for the main structures of IS. Original behavior on the top, modified on the bottom. 


NAS website”, IS has a random memory access pattern, 
while we observed a very specific pattern. In this section 
we explain this pattern and how we used it to improve the 
performance of JS. 

IS was executed with input class D for the performance 
evaluation, resulting in a memory usage of 33.5Gib, and class 
B for the analysis, with a memory usage of 0.25Gib. 

The top of Figure 6 shows the original access distributions 
for the three main structures of IS. We can see that each 
structure has a different access pattern: key_array’s (Fig- 
ure 6(a)) access distribution shows that each thread works 
on a different part of the structure, which permits auto- 
mated tools perform an efficient data/thread mapping on 
it. On the other hand, key_buff2 (Figure 6(b)) is com- 
pletely shared by all threads. key_buff1’s access distri- 
bution (Figure 6(c)) is the most interesting one. We can 
see that almost all accesses occur in pages in the middle of 
the structure (from page 500 to 1500), and those pages are 
shared by all threads. This means that the number of ac- 
cess per page for each thread follows a Gaussian distribution 
centered in the middle of the structure. 

We can identify the source of this pattern in the IS source 
code. Indeed, all the accesses to key_buff1 are linear, 
except in one OpenMP parallel loop where they depend on 
the value of key_buff2. 

As we noticed in Figure 6(c) that the values of key_buff2 
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follow a Gaussian distribution, we can design a distribution 
of the threads that provides both a good load balancing and 
locality of data. By default, OpenMP threads are sched- 
uled dynamically to avoid unbalanced distribution of work, 
but the developers also propose a cyclic distribution of the 
threads over the loop. For our distribution, we split the 
loop into two equal parts and distribute each part among 
the threads in a round-robin way. This modification can 
be done by simply changing one line of code, the #pragma 
omp before the parallel loop. 

With this code modification, we obtain the access distri- 
bution shown in the bottom of Figure 6. We can see that 
now each thread accesses a different part of key_buff1. 
Furthermore, if most of the accesses still occur in the mid- 
dle of the structure, the average number of access across the 
structure is the same for all threads, which means that our 
distribution preserves the good load balancing. Our modifi- 
cation has also changed key_buff2’s accesses distribution. 
We can see that each thread uses mostly one part of the 
array and again the load balance is preserved. 

The main point of our code modification is to improve the 
affinity between thread and memory, therefore we need to 
pin each thread on a core to keep them close to the data 
they access. TABARNAC also shows us that the first touch 
is always done by the thread actually using the data for IS, 
therefore we do not need to explicitly map the data to the 
NUMA nodes. 

We compare the execution time of IS (class D) for the 
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Figure 7: Speedup for IS (class D) compared to the baseline. 


three scheduling methods, Dynamic, Cyclic with a step of 1 
and Cyclic-Split: cyclic with the proposed distribution. For 
the two first methods, we compare the execution time on 
the base operating system, the interleave policy and with 
NUMA balancing enabled. As we map threads manually, 
interleave and NUMA Balancing are not relevant with our 
modifications and are therefore not evaluated. 

Figure 7 shows the speedup of IS compared to the default 
version (Dynamic) for each scheduling method and for each 
optimization technique. The first thing to notice is that 
with the default Dynamic scheduling, both Interleave and 
NUMA Balancing slow the application down, by up to 10%. 
This shows that simple optimization policies can actually 
reduce performance for NUMA-unaware code. The Cyclic 
scheduling, proposed in the original code, already provides 
up to 13% of speedup. We can see that both interleave and 
NUMA Balancing are not suitable for this scheduling, since 
they reduce the performance gains. The Cyclic-Split ver- 
sion provides more than 20% of speedup with a very small 
code modification. This example shows how analyzing an 
application’s memory behavior can lead to significant execu- 
tion time improvement on an already optimized application 
where automatic techniques can actually slow the applica- 
tion down. 


5.3 Overhead Analysis 


Our last experiment aims at evaluating the instrumen- 
tation cost of TABARNAC. To do so, we executed all of 
the NAS Parallel Benchmarks in class B with 64 threads 
on both evaluation systems and compared the original exe- 
cution time to the execution time with instrumentation en- 
abled. 

As we can see in Figure 8, on the Intel machine, the in- 
strumentation slows the execution down by a factor from 
10 to 30. On the AMD machine, the overhead is almost al- 
ways higher, and for pathological cases, is two to three times 
slower than on the Intel machine. Although this overhead 
is not negligible, we have to consider the fact that often we 
can instrument smaller versions of the applications, as we fo- 
cus on the general behavior. Moreover our method is more 
precise than sampling and thus one run is often enough. 
Finally, as our analysis is designed to be used during the de- 
velopment phase and at runtime in an automated tool, we 
consider that this overhead is acceptable. 
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Figure 8: TABARNAC’s instrumentation overhead. 


5.4 Summary of Results 


Our experiments have highlighted the fact that although 
automated tools such as NUMA Balancing can be efficient, 
in some cases they result in performance losses. Moreover, 
although simple static mapping policies can result in sub- 
stantial improvements, the best policy depends on the mem- 
ory accesses behavior of the parallel application. Therefore, 
it is necessary to understand this behavior to select the most 
appropriate mapping policy. 

Our tools and methodology enables developers and users 
to achieve performance improvements in two ways. First, 
by providing a deep understanding of the memory access 
behavior, it enables the user to find the best mapping pol- 
icy. Second, this knowledge can be used to identify and fix 
inefficient memory behavior. Our experiments showed that 
both situations result in significant performance gains. 


6. CONCLUSIONS AND FUTURE WORK 


In this paper, we presented TABARNAC, a set of tools to 
analyze and optimize the memory behavior of parallel appli- 
cations running on NUMA machines. We provide a custom 
memory tracer based on the Pin dynamic binary instrumen- 
tation tool which records the number of memory reads and 
writes performed by all threads for each data structure. The 
advantage of instrumentation is that it is the most accurate 
and portable way to generate memory traces. Despite the 
overhead caused by the instrumentation, our tool is efficient 
enough to analyze even huge applications in a reasonable 
time. 

While other tools show how many remote access are trig- 
gered by which NUMA node, line of code or data structure, 
we provide information on how data structures are accessed. 
This information allows the user to understand why perfor- 
mance issues occur. TABARNAC presents this information 
through several meaningful yet readable plots. Each plot 
is preceded by explanations on how to read it, what kind 
of memory access issues it can help to identify and how to 
solve them. 

We analyzed two parallel applications with TABARNAC: 
Ondes8D, a real life application that simulates seismic waves, 
and IS from the NAS Parallel Benchmarks which is known 
for being memory intensive with a random memory access 


pattern. For both applications, TABARNAC helped us un- 
derstand their performance issues. Using this knowledge, we 
proposed simple code modifications to optimize the mem- 
ory behavior resulting, for each application, in significant 
speedups compared to the original version (up to 60% speedup) 
Improvements were also substantially higher than those pro- 
vided by automated tools. 

Future work will move in two directions. First, we will 
improve the structure detection support to be able to an- 
alyze Fortran programs, as many scientific applications are 
written in Fortran. Second, we will improve the detection 
of inefficient memory access behavior, such as an all-to-all 
sharing, to make the analysis partly automatic. 
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